Remember to create a working directory, and use an R script in Rstudio to type your command lines, as we covered in our previous workhop
Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate types of single-cell data. After this short introduction workshop you can read Seurat offical website to dive a bit deeper.
SeuratData is a mechanism for distributing datasets in
the form of Seurat objects using Râs internal package and data
management systems. It represents an easy way for users to get access to
datasets that are used in the Seurat vignettes.
(If you have installed them before workshop, you do not need to run this block of code.)
install.packages('Seurat')
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
remotes::install_github("satijalab/seurat-data", quiet = TRUE)
We will use Seurat V5, which was published last year. Seurat V5 has gradually gained popularity due to its faster running speed. However, Seurat V5 has some data structure changes compared with older versions (V3 & V4), which may cause some old codes to fail to run. More details can be found on this website.
library(Seurat)
options(Seurat.object.assay.version = "v5")
library(SeuratData)
We will use the pbmcsca dataset. This public dataset
includes single-cell RNA-seq data of human peripheral blood mononuclear
cells (PBMCs) using multiple sequencing platforms. Only data that
passed quality control are included in the pbmcsca
dataset.
data("pbmcsca")
pbmcsca <- UpdateSeuratObject(pbmcsca)
(You can ignore the warning message).
table(pbmcsca$Method)
##
## 10x Chromium (v2) 10x Chromium (v2) A 10x Chromium (v2) B 10x Chromium (v3)
## 3362 3222 3222 3222
## CEL-Seq2 Drop-seq inDrops Seq-Well
## 526 6584 6584 3773
## Smart-seq2
## 526
The table indicates the number of cells sequenced using different platforms.
In this workshop, we will consider two scRNA-seq data sets generated from two platforms, 10x Chromium (v2) & 10x Chromium (v3). PBMCs were sequenced from two patients.
The raw count matrix and the information of each gene and each cell
are saved in a Seurat object pbmc_10x_v2 and
pbmc_10x_v3 independently. In addition, we combine the two
sequencing results without any processing and store them in the Seurat
object pbmc_combo:
pbmc_10x_v2 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v2)"]
pbmc_10x_v3 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v3)"]
pbmc_combo <- pbmcsca[,pbmcsca$Method %in% c("10x Chromium (v2)", "10x Chromium (v3)")]
The Seurat object is a representation of single-cell expression data
for R; each Seurat object revolves around a set of cells and consists of
one or more Assay objects, or individual representations of expression
data (eg. RNA-seq, ATAC-seq, etc). These assays can be reduced from
their high-dimensional state to a lower-dimension state and stored as
DimReduc objects.
Seurat objects also store additional metadata, both at the cell and feature level (stored within individual assays). The object is designed to be as self-contained as possible, and easily extendable to new methods.
We use Seurat object pbmc_10x_v3 as an example.
The raw count matrix of scRNA-seq experiment is here:
pbmc_10x_v3@assays$RNA
â Count matrix in Seurat A count matrix from a Seurat object displays the genes in rows and the cells in columns.
The information and labels of each cell is here:
pbmc_10x_v3@meta.data
The dimension reduction information is stored here:
pbmc_10x_v3@reductions
Letâs start with a simple case: the data generated using the the 10x
Chromium (v3) platform (i.e the Seurat object
pbmc_10x_v3.
Letâs first take a look at how many cells and genes passed Quality Control (QC).
â Count matrix in Seurat A count matrix from a Seurat object displays the genes in rows and the cells in columns.
dim(pbmc_10x_v3)
## [1] 33694 3222
Here we have 3,222 cells with 33,694 genes that passed QC.
We use the Seurat function NormalizeData() to normalize
raw counts. By default, Seurat implements a global-scaling normalization
method called LogNormalize that normalizes the gene
expression measurements for each cell by the total number of counts
accross all genes, and multiplies this by a scaling factor (10,000 by
default), then log transforms the result:
pbmc_10x_v3 <- NormalizeData(object=pbmc_10x_v3, normalization.method = "LogNormalize",
scale.factor = 10000)
We use the Seurat function FindVariableFeatures() to
select highly variable genes which have most of useful information for
downstream analysis.
Here we select the top 3,000 most variable genes to save some computing time. In practice, you can select more genes (5,000 or more) to preserve more information from the scRNA-seq experiment:
pbmc_10x_v3 <- FindVariableFeatures(pbmc_10x_v3, selection.method = "vst", nfeatures = 3000)
The single cell dataset likely may contain âuninterestingâ sources of variation, for example technical noise, batch effects, or even biological sources of variation (cell cycle stage). As suggested by Buettner et al, 2015, regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering.
Seurat constructs linear models to predict gene expression based on
user-defined variables. The scaled z-scored residuals of these models
are stored in the scale.data slot, and are used for
dimensionality reduction and clustering.
We use the Seurat function ScaleData() to obtain the
scaled matrix:
pbmc_10x_v3.all.genes <- rownames(pbmc_10x_v3)
pbmc_10x_v3 <- ScaleData(pbmc_10x_v3, features = pbmc_10x_v3.all.genes)
We perform PCA on the scaled data. By default, the genes in
pbmc_10x_v3@var.genes are used as input, but they can be
defined by specifying the argument pc.genes.
Performing dimensionality reduction on highly variable genes can improve performance. However, with UMI data â particularly after regressing out technical variables, we often see that PCA returns similar (albeit slower) results when run on much larger subsets of genes, including the whole transcriptome.
We run PCA on top 3,000 most variable genes:
pbmc_10x_v3 <- RunPCA(pbmc_10x_v3, features = VariableFeatures(object = pbmc_10x_v3))
## PC_ 1
## Positive: LYZ, FCN1, CLEC7A, CPVL, SERPINA1, SPI1, S100A9, AIF1, NAMPT, CSTA
## CTSS, MAFB, MPEG1, NCF2, FGL2, VCAN, S100A8, TYMP, CST3, LST1
## CYBB, CFD, FCER1G, TGFBI, SLC11A1, GRN, CD14, SLC7A7, PSAP, RNF130
## Negative: IL32, CCL5, TRBC2, TRAC, CST7, CD69, RORA, CTSW, CD247, ITM2A
## TRBC1, C12orf75, IL7R, CD8A, GZMA, NKG7, CD7, LDHB, GZMH, CD6
## CD8B, PRF1, BCL11B, LYAR, FGFBP2, HOPX, LTB, TCF7, KLRD1, ZFP36L2
## PC_ 2
## Positive: NRGN, PF4, SDPR, HIST1H2AC, MAP3K7CL, GNG11, PPBP, GPX1, TUBB1, SPARC
## CLU, PGRMC1, RGS18, FTH1, MARCH2, TREML1, HIST1H3H, ACRBP, AP003068.23, NCOA4
## PRKAR2B, CMTM5, CD9, CA2, TAGLN2, TSC22D1, CTTN, HIST1H2BJ, MTURN, TMSB4X
## Negative: EEF1A1, TMSB10, RPS2, RPS12, RPS18, RPLP1, RPS8, IL32, S100A4, RPLP0
## PFN1, NKG7, CYBA, ARL4C, HSPA8, CST7, HSP90AA1, ZFP36L2, S100A6, ANXA1
## CTSW, CORO1A, LDHA, CD247, GZMA, CALR, YBX1, S100A10, CFL1, ARPC3
## PC_ 3
## Positive: CCL5, TMSB4X, SRGN, NKG7, ACTB, CST7, S100A4, CTSW, ANXA1, GZMH
## FGFBP2, PRF1, GZMA, IL32, GZMB, C12orf75, KLRD1, GNLY, GAPDH, CD247
## MYO1F, NRGN, ID2, ACTG1, CD7, PF4, SDPR, PPBP, HIST1H2AC, CD8A
## Negative: CD79A, HLA-DQA1, MS4A1, BANK1, IGHM, LINC00926, IGHD, TNFRSF13C, HLA-DQB1, CD74
## IGKC, HLA-DRA, BLK, CD83, ADAM28, CD22, HLA-DRB1, CD37, NFKBID, CD79B
## P2RX5, VPREB3, IGLC2, JUND, FCER2, TCOF1, GNG7, COBLL1, RALGPS2, CD19
## PC_ 4
## Positive: IL7R, S100A12, VCAN, S100A8, SLC2A3, CD14, CSF3R, S100A9, MS4A6A, CD93
## IER3, FOS, RCAN3, ZFP36L2, EGR1, RP11-1143G9.4, RGCC, MGST1, LEPROTL1, CLEC4E
## SOCS3, VIM, THBS1, SELL, CXCL8, LTB, SGK1, TNFAIP3, MAL, IRF2BP2
## Negative: FCGR3A, CDKN1C, HES4, CSF1R, CKB, RHOC, ZNF703, MTSS1, TCF7L2, SIGLEC10
## MS4A7, FAM110A, HLA-DPA1, IFITM2, CTSL, PLD4, BATF3, SLC2A6, IFITM3, ABI3
## GZMB, NKG7, HLA-DPB1, FGFBP2, CTD-2006K23.1, CD79B, BID, LRRC25, GZMH, CTSC
## PC_ 5
## Positive: GZMB, FGFBP2, GZMH, PRF1, CST7, NKG7, GNLY, KLRD1, GZMA, CTSW
## CCL5, SPON2, ADGRG1, PRSS23, KLRF1, CCL4, ZEB2, CLIC3, S1PR5, ITGAM
## HOPX, CEP78, CYBA, TRDC, C1orf21, MATK, TTC38, C12orf75, HLA-DRB1, HLA-DPB1
## Negative: IL7R, LTB, LEPROTL1, PAG1, MAL, CDKN1C, RCAN3, LEF1, TCF7, NOSIP
## CAMK4, LDHB, TOB1, TRABD2A, HES4, CSF1R, CKB, JUNB, TMEM123, ZFP36L2
## NELL2, RNASET2, CD28, BCL11B, TNFRSF25, CD27, CTSL, CD40LG, ZNF703, AQP3
The PCA result is stored in pbmc_10x_v3@reduction. The
output information tells us which genes are positively, or negatvely
correlated with the top 5 principal components.
â Choosing parameters in PCA (genes and pcs) How many genes to choose for PCA and how many PCs to use for downstream analysis is a complex and important issue that is out of the scope of todayâs workshop.
But we highly recommend you to read this document before analyzing your own scRNA-seq dat, where the authors show how to use some visualization methods to guide your choice.
We run the t-SNE algorithm first. It is calculated based on PCs, we use top 30 PCs in this example:
pbmc_10x_v3 <- RunTSNE(pbmc_10x_v3, dims = 1:30)
We can then draw a t-SNE plot by using the Dimplot()
function by specifying the argument reduction = 'tsne':
DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE)
We can colour points (cells) by using other information by specyfing
the argument group.by. For example, to display the
sequencing platform:
DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE, group.by = 'Method')
Here we have cells sequenced with only one sequencing platform (10x Chromium v3).
Exercise 1: Plot different t-SNE with a different number of PCs. Do you observe any changes?
Solution here
pbmc_10x_v3 <- RunTSNE(pbmc_10x_v3, dims = 1:5)
DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE, group.by = 'Experiment')
*â Interpreting t-SNE results ** As presented earlier, t-SNE interpretation must be taken with caution. Nowadays using UMAP visulization is more appropriate.
UMAP is also based on PCs. Here we use the top 30 PCs:
pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims=1:30)
We draw the UMAP plot with the Dimplot() function by
specifying the argument reduction = 'umap'. We can specify
the argument group.by to color the cells according to
platform or experiment name.
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Experiment')
Exercise 2: Plot different UMAP with a different number of PCs. Do you observe any changes?
Solution here
By default, Seurat uses the Louvain algorithm.
The Louvain algorithm requires a neighbor graph as input. Therefore,
we first run the FindNeighbors() function first. Note that
FindNeighbors() is also based on PCs, here we use all 30
top PCs:
pbmc_10x_v3 <- FindNeighbors(pbmc_10x_v3, dims = 1:30)
We then run the FindClusters() function for clustering.
The argument Algorithm=1 means we are using the Louvain
algorithm for clustering:
pbmc_10x_v3 <- FindClusters(object = pbmc_10x_v3, resolution = 0.3, algorithm=1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3222
## Number of edges: 134293
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9386
## Number of communities: 10
## Elapsed time: 0 seconds
Other options are available such as Algorithm=4 for the
Leiden algorithm, but you have to install Python and some Python
packages first. You can also try different resolution for more or less
clusters. See ??FindClusters for more details.
We can use t-SNE and UMAP to visualize the clustering results. The
argument group.by='seurat_clusters' is used to color the
cells according to the clustering results.
tSNE visualisation:
DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE, group.by = 'seurat_clusters')
UMAP visualisation:
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'seurat_clusters')
Discussion 1: Which visualization is better, and why?
Solution here
According to the UMAP representation above, it seems that cluster is different from the other cells and cell clusters. Letâs dig a bit further to characterize this cluster.
First, we look for marker genes that were significantly differential
expressed in Cluster 3 compared with other clusters. We use the
FindAllMarkers() function to identify marker genes for each
cluster.
This command line takes some time to run:
pbmc_10x_v3.markers <- FindAllMarkers(pbmc_10x_v3, min.pct = .25, logfc.threshold = .25)
We then extract the top 5 marker of cluster 3 with the smallest p-values:
cluster3.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==3),]
cluster3.markers[1:5, ]
We then search these marker genes in this database website https://panglaodb.se/search.html. We need to remove the dash (â-â) before searching.
Based on the result provided by the database, it looks like cluster 3
might be a cluster of B cells. We can check whether this is the case by
checking the cell-type ground truth information pre-saved
in the pbmc_10x_v3 object, and plotting this into a
tSNE:
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'CellType')
Yes, we are correct, cluster 3 is annotated as B cells.
You can try to annotate other clusters by yourself. The process of annotating each cluster using marker genes is also known as manual cell type annotation. We will try to do it automatically in the next section.
â Ground-truth about cell-types Typically, cell-type information is unknown in single cell data, as this is what we are trying to find out! But publicly available datasets may have been previously annotated using the technique higlighted above, or using other reference datasets.
We can use violin plots to visualize the expression of one marker genes across all cell types.
For example, we choose the most significant marker genes from Cluster 4.
cluster4.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==4),]
cluster4.markers[1:5, ]
The most significant marker gene of Cluster 4 is VCAN.
If we want to visualize a marker gene across all cell types, we can
use the Idents() function, specifying that
CellType should be shown on the x-axios before using the
VlnPlot() function to draw the violin plot:
Idents(object = pbmc_10x_v3) <- "CellType"
VlnPlot(pbmc_10x_v3, features = 'VCAN')
Exercise 3: What is the interpretation of this output?
pbmc_10x_v2Assume we have all cell type annotation for pbmc_10x_v3
and that we have a new set of data (stored in pbmc_10x_v2
that we wish to annotate automatically.
Remember: pbmc_10x_v2 and
pbmc_10x_v3 include different cells from different patients
and sequenced using platforms.
pbmc_10x_v2Exercise 4: use the functions presented in Section 2 to normalise, select highly variable genes, scale, run a PCA and visualize the data using UMAP.
pbmc_10x_v3 as a reference using SeuratIn Seurat we can learn cell type annotation results from one scRNA-seq data to provide cell type annotation for another scRNA-seq dataset.
We use the FindTransferAnchors() function to predict
which cells in two datasets are of the same cell type. Here we use
pbmc_10x_v3 as the reference data set, and
pbmc_10x_v2 is the query data.
We specify that we use the top 30 PCs:
anchors <- FindTransferAnchors(reference = pbmc_10x_v3, query = pbmc_10x_v2,
dims = 1:30)
We then assign the cell-type of the cells in pbmc_10x_v2
using the TransferData() function:
predictions <- TransferData(anchorset = anchors, refdata = pbmc_10x_v3$CellType,
dims = 1:30)
Seurat will provide a table with the most likely cell type and the
probability of each cell type. We assign the most likely cell type to
the pbmc_10x_v2 object:
pbmc_10x_v2@meta.data$CellType_Prediction <- predictions$predicted.id
We then use UMAP to visualize this annotation:
DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType_Prediction')
Azimuth is a web application that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq or ATAC-seq experiment.
The input of Azimuth can be a Seurat object. In order to reduce the size of the uploaded file, we retain only the useful information for cell type annotation with the following command lines:
DefaultAssay(pbmc_10x_v2) <- "RNA"
pbmc_10x_v2_simple <- DietSeurat(object = pbmc_10x_v2, assays = "RNA")
saveRDS(pbmc_10x_v2_simple, 'pbmc_10x_v2.Rds')
An Rds file called pbmc_10x_v2.Rds is saved in your
working directory. You can check where is your working directory by
using the getwd() function.
Exercise 5 Open the Azimuth website: https://azimuth.hubmapconsortium.org/.
For cell type annotation (see also the slides here):
Find âReferences for scRNA-seq Queriesâ -> Then find âHuman - PBMCâ -> click âGo to Appâ
Click âBrowseâ -> find âpbmc_10x_v2.Rdsâ at your working directory -> Click âOpenâ
Waiting for the Rds file upload to the website
Click âMap cells to referenceâ
Click âDownload Resultsâ
Find âPredicted cell types and scores (TSV)â
Click âDownloadâ to get the cell type annotation result stored in azimuth_pred.tsv
Copy the tsv file (azimuth_pred.tsv) to your R working directory
The tsv file has the same data structure of Seurat annotation result
(predictions). We read the tsv file, then add the annotation result to
the pbmc_10x_v2 object with the AddMetaData()
function:
azimuth_predictions <- read.delim('azimuth_pred.tsv', row.names = 1)
pbmc_10x_v2 <- AddMetaData(object = pbmc_10x_v2, metadata = azimuth_predictions)
We use UMAP to visualize the cell type annotation result from Azimuth.
DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'predicted.celltype.l2')
Here is the cell type annotation results provided by the data provider:
DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType')
Exercise 6 Assuming the ground truth of the cell type annotation is from the provider, which cell type annotation is better from Seurat or Azimuth? Why?
Solution here
We have combined the two data sets without any processing in the
Seurat object pbmc_combo. In this section we will focus on
combining these data sets (note: the process would be similar if you are
combining different patients).
pbmc_combo
directly?Exercise 7: use the functions presented in Section 2
to normalise, select highly variable genes, scale, run a PCA and
visualize the data using UMAP on pbmc_combo. Highlight the
issues in this basic analysis where we are combining two independent
data sets.
In Seurat, we use the FindIntegrationAnchors() function
to identify cells with similar biological information between two data
sets. The difference between cells in two data sets with similar
biological information is considered as batch effect:
anchor_combo <- FindIntegrationAnchors(object.list = list(pbmc_10x_v2, pbmc_10x_v3), dims = 1:30)
We then use the IntegrateData() function to remove batch
effects and integrate the two data sets:
pbmc_combo <- IntegrateData(anchorset = anchor_combo, dims = 1:30)
Exercise 8: Below is the code to visualize the batch corrected data using UMAP (we need to scale the data first). What do you observe?
# Scaling
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)
# PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))
# UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'Method')
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'Experiment')
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'CellType')
Solution here